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1  Introduction 


Basic  tasks  in  autonomous  robot  navigation  are  localization  and  positioning.  Localization  is  the 
act  of  recognizing  the  environment,  that  is,  assigning  consistent  labels  to  different  locations,  and 
positioning  is  the  act  of  computing  the  coordinates  of  the  robot  in  the  environment.  Positioning 
is  a  task  complementary  to  localization,  in  the  sense  that  position  (e.g.,  “1.5  meters  northwest 
of  table  T”)  is  often  specified  in  a  place-specific  coordinate  system  (“‘in  room  911").  In  this 
paper  we  suggest  a  method  of  both  localization  and  positioning  using  vision  alone.  A  variant 
of  the  positioning  problem,  referred  to  as  repositioning,  involving  the  return  to  a  previously 
visited  place  is  also  discussed. 

Previous  studies  have  examined  the  problems  of  localization  and  positioning  under  a  variety 
of  conditions,  defined  by  the  kind  of  sensor(s)  employed,  the  nature  of  the  environment,  and 
the  representations  used.  We  can  distinguish  between  active  and  passive  sensing,  indoor  and 
outdoor  navigation  tasks,  and  metric  and  topological  representations.  The  metric  approach 
attempts  to  utilize  a  detailed  geometric  description  of  the  environment,  while  the  topological 
approach  uses  a  more  qualitative  description  including  a  graph  with  nodes  representing  places 
and  arcs  representing  sequences  of  actions  that  would  result  in  moving  the  robot  from  one  node 
to  another. 

In  the  paper  we  consider  a  robot  that  uses  a  passive  sensor,  vision,  in  an  indoor  environment. 
The  environment  cannot  be  changed  by  the  robot  to  improve  its  performance:  neither  beacons 
nor  floor  or  wall  markings  are  employed.  The  paper  addresses  both  the  localization  and  the 
positioning  problems.  Solutions  to  these  problems  are  presented  based  on  object  recognition 
techniques.  The  method,  based  on  the  linear  combinations  scheme  of  [17],  represents  scenes 
by  sets  of  their  2D  images.  Localization  is  achieved  by  comparing  the  observed  image  to 
linear  combinations  of  model  views,  and  the  position  of  the  robot  is  computed  by  analyzing 
the  coefficients  of  the  linear  combination  that  aligns  the  model  to  the  image.  .Also,  a  simple, 
••qualitative”  solution  to  the  repositioning  problem  using  the  linear  combinations  scheme  is 
presented. 

The  rest  of  the  paper  is  organized  as  follows.  The  next  section  describes  the  localization  and 
positioning  problems  and  surveys  previous  solutions.  The  method  of  localization  and  positioning 
using  linear  combinations  of  model  views  is  described  in  Section  3.  The  method  assumes  weak 
perspective  projection.  An  iterative  scheme  to  account  for  perspective  distortions  is  presented 
in  Section  4.  An  analysis  of  the  error  resulting  from  the  projection  assumption  is  presented  in 


Section  5.  Constraints  imposed  on  the  motion  of  the  robot 
indoor  environments  can  be  used  to  reduce  the  complexity 

as  a  result  of  special  properties  of 
of  the  method  presented  here.  This 

topic  is  covered  on  Section  6.  Experimental  results  follow. 
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2  The  Problem 


Localization  and  positioning  from  visual  input  are  defined  in  the  following  way:  Given  a  fa¬ 
miliar  environment,  identify  the  observed  environment,  and  then  find  your  position  in  that 
environment.  Localization  resembles  the  task  of  object  recognition,  with  objects  replaced  by 
scenes.  Once  localization  is  accomplished,  positioning  can  be  performed. 

One  problem  a  system  for  localization  and  positioning  should  address  is  the  variability  of 
images  due  to  viewpoint  changes.  The  inexactness  of  practical  systems  makes  it  difficult  for  a 
robot  to  return  to  a  specified  position  on  subsequent  visits.  The  visual  data  available  to  the 
robot  between  visits  varies  in  accordance  with  the  viewing  position  of  the  robot.  A  localization 
system  should  be  able  to  recognize  scenes  from  different  positions  and  orientations. 

Another  problem  is  that  of  changes  in  the  scene.  At  subsequent  visits  the  same  place  may 
look  different  due  to  changes  in  the  arrangement  of  the  objects,  the  introduction  of  new  objects, 
and  the  removal  of  others.  In  general,  some  objects  tend  to  be  more  static  than  others.  While 
chairs  and  books  are  often  moved,  tables,  closets,  and  pictures  tend  to  change  their  position 
much  less,  and  walls  are  almost  guaranteed  to  be  static.  Static  cues  naturally  are  more  reliable 
than  mobile  ones.  Confining  the  system  to  static  cues,  however,  may  in  some  cases  result  in 
failure  to  recognize  the  scene  due  to  insufficient  cues.  The  system  should  therefore  attempt  to 
rely  on  static  cues,  but  should  not  ignore  the  dynamic  cues. 

Solutions  to  the  problem  of  localization  from  visual  data  require  a  large  memory  and  heavy 
computation.  Existing  systems  often  try  to  reduce  this  cost  by  using  sparse  representations 
and  by  exploiting  contextual  information.  Sparse  representations  are  introduced  in  (10.  l-l). 
Mataric  [10]  represents  scenes  as  sequences  of  landmarks  (such  as  walls,  doors,  etc.)  extracted 
by  tracing  the  boundaries  of  the  scene  using  a  sonar  and  a  compass.  Metric  information  of 
and  between  the  landmarks  is  not  stored.  Sarachik  [14]  recognizes  a  room  by  us  dimensions, 
which  are  measured  by  identifying  and  locating  the  top  corners  of  the  room  using  stereo  data 
(obtained  from  four  cameras).  In  both  cases  the  representation  is  very  sparse,  and  the  scene  is 
therefore  often  ambiguous. 

Richer  representations  are  used  in  [2.  4]  where  higher  success  rates  are  reported.  Braunegg 
[2]  represents  the  scene  by  an  occupancy  table,  a  2D  bit  array  which  contains  a  1  at  every 
location  occupied  by  some  object.  The  table  is  constructed  by  taking  stereo  pictures  covering 
360°  from  the  middle  of  the  room  and  projecting  the  obtained  3D  data  onto  the  floor.  The 
method  suffers  from  loss  of  information  due  to  the  projection  onto  the  floor. 

Engelson  e(  al.  [4]  repre.sent  the  scene  by  a  set  of  invariant  “signatures",  k  signature  is 
usually  composed  of  low- resolution  gray-level  or  range  data  obtained  by  blurring  an  image.  A 
set  of  signatures  taken  from  different  viewpoints  are  stored.  A  scene  is  recognized  if  the  robot 
encounters  a  signature  similar  to  one  of  the  stored  signatures. 

Systems  that  use  the  full  information  provided  by  the  image  (e.g.,  [6,  12])  usually  rely 
on  contextual  information  to  avoid  scanning  all  the  models  in  the  memory  and  to  reduce  the 
computational  cost  of  comparing  a  model  to  the  image.  The  system  follows  a  predetermined 
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path,  so  that  the  identity  of  each  visited  location  is  known  in  advance,  and  localization  becomes 
a  verification  problem.  Path  continuity  in  many  cases  is  essential,  and  the  so-called  “drop-off" 
problem  is  not  addressed.  The  emphasis  in  these  systems  is  on  positioning,  which  is  used  to 
keep  the  robot  on  the  path.  It  is  typical  for  these  systems  (e.g.,  [5,  6,  12])  to  use  a  full  3D 
model  of  the  environment. 

Onoguchi  el  al.  [12],  among  others,  represent  the  environment  by  a  set  of  landmarks  selected 
from  pairs  of  stereo  images  by  a  liiiman  operator.  These  landmarks  are  transformed  by  an  image 
processing  program  which  is  designed  so  as  to  identify  the  specific  landmark  using  specific 
e.xlraction  instructions  (such  as  what  features  to  look  for  and  at  what  locations).  Localization 
is  achieved  by  applying  the  e.xlraction  procedure  specified  for  the  next  landmark.  Once  a 
landmark  is  identified,  the  position  of  the  robot  relative  to  that  landmark  is  determined  by 
comparing  the  dimensions  of  the  observed  landmark  with  those  of  the  stored  model. 

The  method  presented  in  this  paper  represents  the  environment  using  a  set  of  edge  maps. 
Localization  and  positioning  are  achieved  by  comparing  images  of  the  environment  to  linear 
combinations  of  the  model  views.  The  method  uses  rich  visual  information  to  represent  the 
scene.  The  system  is  flexible,  in  many  cases  it  is  capable  of  recognizing  its  location  from 
one  image  only  (360°  coverage  is  not  required).  When  one  image  is  not  sufficient,  additional 
itnages  can  be  acc|uired  to  solve  the  localization  problem.  Context  can  be  used  to  determine 
the  order  of  comparison  of  the  models  to  the  observed  image  and  to  increase  the  confidence  of 
a  given  match,  but  context  is  not  essential:  the  system  can  also,  by  performing  more  extensive 
computations,  solve  the  "drop-ofT'  problem. 


3  The  Method 

The  problems  of  localization  and  object  recognition  are  similar  in  many  ways.  Both  problems 
retiuire  the  matching  of  visual  images  to  stored  models,  either  of  the  environment  or  of  the 
ob.served  objects.  Both  problems  face  similar  difficulties,  such  as  varying  illumination  conditions 
and  changes  in  appearance  due  to  viewpoint  changes.  Similar  methodologies  therefore  can  be 
used  for  solving  both  problems. 

.A  particular  application  of  an  object  recognition  scheme,  the  Linear  Combinations  (LC) 
scheme  [17].  to  the  problems  of  localization  and  positioning  is  discussed  below.  The  environment 
i.^  represented  in  ihi.-)  scheme  by  a  small  set  of  views  obtained  from  different  viewpoints  and  by 
the  correspondence  between  the  views.  .A  novel  view  is  recognized  by  comparing  it  to  linear 
combinations  of  the  stored  views.  Positioning  is  achieved  by  recovering  the  position  of  the 
camera  relative  to  its  position  in  the  model  views  from  the  coefficients  of  the  aligning  linear 
combination.  In  the  rest  of  this  section  we  review  the  linear  combinations  approach  and  describe 
it^  application  to  both  localization  and  positioning.  The  section  concludes  with  a  solution  to 
the  problem  of  repositioning,  that  is.  the  problem  of  returning  to  a  previously  visited  position 
In  "locking"  into  an  image  acquired  in  that  position. 
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3.1  Localization 


The  problem  of  localization  is  defined  as  follows:  given  P,  a  2D  image  of  a  place,  and  A1 ,  a  set  of 
stored  models,  find  a  model  A/'  €  M  such  that  P  matches  A/'.  Localization  is  the  recognition 
of  a  place.  It  can  therefore  potentially  benefit  from  using  an  object  recognition  methodology. 
A  common  approach  to  handling  the  problem  of  recognition  from  different  viewpoints  is  by 
comparing  the  stored  models  to  the  observed  environment  after  the  viewpoint  is  recovered  and 
compensated  for.  This  approach,  called  alignment,  is  used  in  a  number  of  studies  of  object 
recognition  [1,  7,  8,  9,  15,  16].  We  apply  the  alignment  approach  to  the  problem  of  localization. 
The  system  described  below  uses  the  “Linear  Combinations”  (LC)  scheme,  which  was  suggested 
by  Ullman  and  Basri  [17]. 

We  begin  with  a  brief  review  of  the  LC  scheme.  LC  is  defined  as  follows.  Given  an  image,  we 
construct  two  view  vectors  from  the  feature  points  in  the  image,  one  contains  the  ^--coordinate.s 
of  the  points,  and  the  other  contains  the  y-coordinates  of  the  points.  An  object  (in  our  case, 
the  environment)  is  modeled  by  a  set  of  such  views,  where  the  points  in  these  views  are  ordered 
in  correspondence.  The  appearance  of  a  novel  view  of  the  object  is  predicted  by  applying 
linear  combinations  to  the  stored  views.  The  predicted  appearance  is  then  compared  with  the 
actual  image,  and  the  object  is  recognized  if  the  two  match.  The  advantage  of  this  method 
is  twofold.  First,  viewer-centered  representations  are  used  rather  than  object -centered  ones, 
namely,  tnodels  are  composed  of  2D  views  of  the  observed  scene;  second,  novel  appearances  are 
predicted  in  a  simple  and  accurate  way  (under  weak  perspective  projection). 

Formally,  given  P,  a  2D  image  of  a  scene,  and  Ad.  a  set  of  stored  models,  the  objective  is  to 
find  a  model  A/'  €  A/  such  that  P  =  Oj.A/j  for  some  constants  Oj  €  TZ.  It  has  been  shown 
that  this  scheme  accurately  predicts  the  appearance  of  rigid  objects  under  weak  perspective 
projection  (orthographic  projection  and  scale).  The  limitations  of  this  projection  model  are 
discussed  later  in  this  paper. 

More  concretely,  let  p,  =  (i,.y,.c,),  1  <  i  <  n.  be  a  set  of  n  object  points.  Under  weak 
perspective  projection,  the  position  p[  =  of  these  points  in  the  image  are  given  by 

=  sr,,x,  +  «r,2y, -i- sri3;, -I- tj. 

y',  =  -’'^21  +  5T-22y,  -f  sr23:,  +  ty  ( 1 ) 

where  r,j  are  the  components  of  a  3  x  3  rotation  matri.x,  and  s  is  a  scale  factor.  Rewriting  this 
in  vector  equation  form  we  obtain 

x'  =  sriix -I- .<!ri2y -h  .crisz -t- G1 

y'  =  sr2|X  +  sr22y -t- sr23Z -I- tyl  (2) 

where  x,y.z.x',y'  €  /Z”  are  the  vectors  of  x,,  y,.  r,.  x[  and  y,'  coordinates  respectively,  and 
1  =  (1,1 . 1 ).  Consequently. 


x'.y'  €  span{ x,y,z.  1} 


(3) 
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or,  in  other  words,  x'  and  y'  belong  to  a  four-dimensionaJ  linear  subspace  of  72".  (Notice  that 
z'.  the  vector  of  depth  coordinates  of  the  projected  points,  also  belongs  to  this  subspace.  This 
fact  is  used  in  Section  4  below.)  A  four-dimensional  space  is  spanned  by  any  four  linearly 
independent  vectors  of  the  space.  Two  views  of  the  scene  supply  four  such  vectors  [13,  17]. 
Denote  by  xj,  yi  and  X2,  y2  the  location  vectors  of  the  n  points  in  the  two  images;  then  there 
exist  coefficients  U],a2,U3,a4  and  hi,/)2,h3, i»4  such  that 

x'  =  cixi -I- aayi  +  03x2  +  04! 

y'  =  Aixi -h  62yi -I- 63x2  +  64I  (4) 

(Note  that  the  vector  y2  already  depends  on  the  other  four  vectors.)  Since  .R  is  a  rotation 
matrix,  the  coefficients  satisfy  the  following  two  quadratic  constraints: 

a]  +  al  + al  -  b]  -  b]  -  bl  -  2(6163  -  0103  jrn  -|-  2(6263  -  0203 )ri2 

a\b\  +  0262  +  U363  (0163  -|-  0361  )rii  -|-  (0263  -f  0362 )ri2  =  0  (.5) 

To  derive  these  constraints  the  transformation  between  the  two  model  views  should  be  recovered. 
This  can  be  done  under  weak  perspective  using  a  third  image.  Alternatively,  the  constraints 
can  be  ignored,  in  which  case  the  system  would  confuse  rigid  transformations  with  affine  ones. 
This  usually  does  not  prevent  successful  localization  since  generally  scenes  are  fairly  different 
from  one  another. 

A  LC  scheme  for  the  problem  of  localization  is  as  follows:  The  environment  is  modeled 
by  a  set  of  images  with  correspondence  between  the  images.  For  example,  a  spot  can  be 
modeled  by  two  of  its  corresponding  views.  The  corresponding  quadratic  constraints  may  also 
be  stored.  Localization  is  achieved  by  recovering  the  linear  combination  that  aligns  the  model 
to  the  observed  image.  The  coefficients  are  determined  using  four  model  points  and  their 
corresponding  image  points  by  solving  a  linear  set  of  equations.  Three  points  are  sufficient  to 
determine  the  coefficients  if  the  quadratic  constraints  are  also  considered.  Additional  points 
may  be  used  to  reduce  the  effect  of  noise. 

The  LC  scheme  uses  viewer-centered  models,  that  is.  representations  that  are  composed 
of  images.  It  has  a  number  of  advantages  over  methods  that  build  full  three-dimensional 
models  to  represent  the  scene.  First,  by  using  viewer-centered  models  that  cover  relatively  small 
transformations  we  avoid  the  need  to  handle  occlusions  in  the  scene.  If  from  some  viewpoints 
the  scene  appears  different  because  of  occlusions  we  utilize  a  new  model  for  these  viewpoints. 
Second,  viewer-centered  models  are  easier  to  build  and  to  maintain  than  object-centered  ones. 
The  models  contain  only  images  and  correspondences.  By  limiting  the  transformation  between 
the  model  images  one  can  find  the  correspondence  using  motion  methods.  If  large  portions  of 
the  environment  are  changed  between  visits  a  new  model  can  be  constructed  by  simply  replacing 
old  images  with  new  ones. 

One  problem  with  using  the  LC  scheme  for  localization  is  due  to  the  weak  perspective  ap¬ 
proximation.  In  contrast  with  the  problem  of  object  recognition,  where  we  can  generally  assume 
that  objects  are  small  relative  to  their  distance  from  the  camera,  in  localization  the  environ¬ 
ment  surrounds  the  robot  and  perspective  distortions  cannot  be  neglected.  The  limitations 
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of  weak  perspective  modeling  are  discussed  both  mathematically  and  empirically  in  the  next 
two  sections.  It  is  shown  that  in  many  practical  cases  weak  perspective  is  sufficient  to  enable 
accurate  localization.  The  main  reason  is  that  the  problem  of  localization  does  not  require 
accurate  measurements  in  the  entire  image;  it  only  requires  identifying  a  sufficient  number  of 
spots  to  guarantee  accurate  naming.  If  these  spots  are  relatively  close  to  the  center  of  the 
image,  or  if  the  depth  differences  they  create  are  relatively  small  (cis  in  the  case  of  looking  at 
a  wall  when  the  line  of  sight  is  nearly  perpendicular  to  the  wall),  the  perspective  distortions 
are  relatively  small,  and  the  system  can  identify  the  scene  with  high  accuracy.  Also,  views 
related  by  a  translation  parallel  to  the  image  plane  form  a  linear  space  even  when  perspective 
distortions  are  large.  This  case  and  other  simplifications  are  discussed  in  Section  6. 

By  using  weak  perspective  we  avoid  stability  problems  that  frequently  occur  in  perspective 
computations.  We  can  therefore  compute  the  alignment  coefficients  by  looking  at  a  relatively 
narrow  field  of  view.  The  entire  scheme  can  be  viewed  as  an  accumulative  process.  Rather  than 
acquiring  images  of  the  entire  scene  and  comparing  them  all  to  a  full  scene  model  (as  in  [2]) 
we  recognize  the  scene  image  by  image,  spot  by  spot,  until  we  accumulate  sufficient  convincing 
information  that  indicates  the  identity  of  the  place. 

When  perspective  distortions  are  relatively  large  and  weak  perspective  is  insufficient  to 
model  the  environment,  two  approaches  can  be  used.  One  possibility  is  to  construct  a  larger 
number  of  models  so  as  to  keep  the  possible  changes  between  the  familiar  and  the  novel  views 
small.  Alternatively,  an  iterative  computation  can  be  applied  to  compensate  for  these  distor¬ 
tions.  Such  an  iterative  method  is  described  in  Section  4. 


3.2  Positioning 

Positioning  is  the  problem  of  recovering  the  exact  position  of  the  robot.  This  position  can  be 
specified  in  a  fixed  coordinate  system  associated  with  the  environment  (i.e.,  room  coordinates), 
or  it  can  be  associated  with  some  model,  in  which  case  location  is  expressed  with  respect  to  the 
position  from  which  the  model  views  were  acquired.  In  this  section  we  discuss  an  application 
of  the  LC  scheme  to  the  positioning  problem. 

The  idea  is  the  following.  We  assume  a  model  composed  of  two  images,  P\  and  P2\  their 
relative  position  is  given.  Given  a  novel  image  P' .  we  first  align  the  model  with  the  image 
(i.e..  localization).  By  considering  the  coefficients  of  the  linear  combination  the  robot’s  position 
relative  to  the  model  images  is  recovered.  To  recover  the  absolute  position  of  the  robot  in  the 
loom  the  absolute  positions  of  the  model  views  should  also  be  provided. 

Assuming  Pi  is  obtained  from  P\  by  a  rotation  R,  translation  t  =  (tx.ty),  and  scaling  s,  the 
coordinates  of  a  point  in  P'.  (x',y').  can  be  written  as  linear  combinations  of  the  corresponding 
model  points  in  the  following  way: 

x'  =  0111-1-025/1+0312-1-04 

y'  =  6111+625/1+63^2-1-64  (C) 
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Substituting  for  we  obtain 


i'  =  aiij  +  a2i/i  +  a3(5ri,ii  +  5ri2yi  +  sriari  +  tr)  +  04 

y’  =  6iii  +  621/1  +  +  sri2yi  +  sriszi  +  <x)  +  <>4  (') 

and  rearranging  these  equations  we  obtain 

x'  =  (oi +a3sri,)ii +(a2  +  a3sri2)yi +(a3sri3)xi'+(a3<x  +  a4) 

y'  =  (61  +  63srii  )ii  +  (62  +  63sri2)yi  +  (63sri3)2i  +  (631^  +  04)  (8) 

Using  these  equations  we  can  derive  all  the  parameters  of  the  transformation  between  the  model 
and  the  image.  Assume  the  image  is  obtained  by  a  rotation  U,  translation  i„.  and  scaling  .s„. 
Using  the  orthonormality  constraint  we  can  first  derive  the  scale  factor 

=  (oi  +  a3sr,,)^  +  (02  +  0357-12)^  +  (a3sr, 3)2 

=  o^  +  02  +  OgS^  +  2a3S(oirii  +  02ri2)  (9) 

From  Equations  (8)  and  (9).  by  deriving  the  components  of  the  translation  vector.  we  can 
obtain  the  position  of  the  robot  in  the  image  relative  to  its  position  in  the  model  views: 


Ax  =  a^tr  +  04 

=  631^  +  64  (10) 


Note  that  Ar  is  derived  from  the  change  in  scale  of  the  object.  The  rotation  mairi.v  I'  between 
Pi  and  P'  is  given  by 


a,  +  ossm 

O2  +  03*^12 

assriz 

Uii 

Uj2  - 

^Tl 

ni3  = 

6357-23 

U21 

-••n 

U22  = 

1723  = 

7^71 

•As  was  already  mentioned,  the  position  of  the  robot  is  computed  here  relative  to  the  position  of 
the  camera  when  the  first  model  image,  P\,  was  acquired.  Ax  and  Ax  represent  the  motion  of 
the  robot  from  P\  to  P'.  and  the  rest  of  the  parameters  represent  its  3D  rotation  and  elevation. 
To  obtain  the  relative  position  the  transformation  parameters  between  the  model  views.  P^  and 
P2.  are  required. 


3.3  Repositioning 

-An  interesting  variant  of  the  positioning  problem,  referred  to  as  repositioning,  is  defined  a.s 
follows.  Given  an  image,  called  the  target  image,  position  yourself  in  the  location  from  wiiich 


this  image  was  observed.  '  One  way  to  solve  this  problem  is  to  extract  the  exact  position  from 
which  the  target  image  was  obtained  and  direct  the  robot  to  that  position.  In  this  section  we 
are  interested  in  a  more  qualitative  approach.  Under  this  approach  position  is  not  computed. 
Instead,  the  robot  observes  the  environment  and  extracts  only  the  direction  to  the  target 
location.  Unlike  the  exact  approach,  the  method  presented  here  does  not  require  the  recovery 
of  the  transformation  between  the  model  views. 

We  assume  we  are  given  with  a  model  of  the  environment  together  with  a  target  image. 
The  robot  is  allowed  to  take  new  images  as  it  is  moving  towards  the  target.  We  assume  a 
horizontally  moving  platform.  (In  other  words,  we  assume  three  degrees  of  freedom  rather  than 
six;  the  robot  is  allowed  to  rotate  around  the  vertical  axis  and  translate  horizontally.  The 
validity  of  this  constraint  is  discussed  in  Section  6.)  Below  we  give  a  simple  computation  that 
determines  a  path  which  terminates  in  the  target  location.  At  each  time  step  the  robot  acquires 
a  new  image  and  aligns  it  with  the  model.  By  comparing  the  alignment  coefficients  with  the 
coefficients  for  the  target  image  the  robot  determines  its  next  step.  The  algorithm  is  divided 
into  two  stages.  In  the  first  stage  the  robot  fixates  on  one  identifiable  point  and  moves  along 
a  circular  path  around  the  fixation  point  until  the  line  of  sight  to  this  point  coincides  with 
the  line  of  sight  to  the  corresponding  point  in  the  target  image.  In  the  second  stage  the  robot 
advances  forward  or  retreats  backward  until  it  reaches  the  target  location. 

Given  a  model  composed  of  two  images.  Pi  and  Pj,  P2  is  obtained  from  Pi  by  a  rotation 
about  the  V-axis  by  an  angle  o.  horizontal  translation  tx,  and  scale  factor  s.  Given  a  target 
image  Pt.  Pt  is  obtained  from  Pi  by  a  similar  rotation  by  an  angle  0.  translation  t,.  and  scale 

Using  Eq.  (4)  the  position  of  a  target  point  (xt,yt)  can  be  expressed  as 


I,  =  a|ii  +  a3X2  + 

yi  =  hVi  (12) 


(The  rest  of  the  coefficients  are  zero  since  the  platform  moves  horizontally.)  In  fact,  the  coeffi¬ 
cients  are  given  by 


St  sin(o  -  0) 

ai  =  - : - 

sin  Q 

s,  sin  0 

03  =  — : - 

ssin  a 

txSt  sin  0 

04  =  t, - ^ - 

s  sin  a 

hi  —  St 


(i:i) 


(The  derivation  is  given  in  the  Appendix.) 

At  every  time  step  the  robot  acquires  an  image  and  aligns  it  with  the  above  model.  .Assume 
that  image  Pp  is  obtained  as  a  result  of  a  rotation  by  an  angle  0.  translation  tp.  and  scale  Sp. 

'This  problem  can  be  considered  as  a  variant  of  the  homing  problem.  A  discussion  of  the  general  homing 
problem  vvith  a  "signature-  based"  solution  can  be  found  in[Il]. 
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The  position  of  a  point  (Xp,yp]  is  expressed  by 


where  the  coefficients  are  given  by 


+  C3X2  +  C4 

d2yi 

SpSin(Q:  —  (t>) 


5psin  d> 


ssino 


tj-Sp  sin  4> 


The  step  performed  by  the  robot  is  determined  by 


That  is, 

,  ssin(o-o)  ssin(Q-^)  .  ,  ,,,, 

b  = - - - =  ssina(cot0  -  cot®)  (17) 

sin<Z)  sinp 

The  robot  should  now  move  so  as  to  reduce  the  absolute  value  of  b.  The  direction  of  motion 
depends  on  the  sign  of  a.  The  robot  can  deduce  the  direction  by  moving  slightly  to  the  side 
and  checking  if  this  motion  results  in  an  increase  or  decrease  of  b.  The  motion  is  defined  as 
follows.  The  robot  moves  to  the  right  (or  to  the  left,  depending  on  which  direction  reduces  ||^||) 
by  a  step  Ax. 

A  new  image  P,,  is  now  acquired,  and  the  fixated  point  is  located  in  this  image.  Denote 
it.-'  new  position  by  x^.  Since  the  motion  is  parallel  to  the  image  plane  the  depth  values  of  the 
point  in  the  two  views,  Pp  and  P^,  are  identical.  We  now  want  to  rotate  the  camera  so  as  to 
return  the  fixated  point  to  its  original  position.  The  angle  of  rotation,  /3,  can  be  deduced  from 
the  equation 

Xp  =  x„  cos +  sin (18) 

This  equation  has  two  solutions.  We  chose  the  one  that  counters  the  translation  (namely,  if 
translation  is  to  the  right,  the  camera  should  rotate  to  the  left),  and  that  keeps  the  angle  of 
rotation  small.  In  the  next  time  step  the  new  picture  Pn  replaces  Pp  and  the  procedure  is 
repeated  until  b  vanishes.  The  resulting  path  is  circular  around  the  point  of  focus. 

Once  the  robot  arrives  at  a  position  for  which  ^  =  0  (namely,  its  line  of  sight  coincides 
with  that  of  the  target  image,  and  o  =  0)  it  should  now  advance  forward  or  retreat  backward 
to  adjust  its  position  along  the  line  of  sight.  Several  measures  can  be  used  to  determine  the 
direction  of  motion;  one  example  is  the  term  Ci/fli  which  satisfies 


when  the  two  lines  of  sight  coincide.  The  objective  at  this  stage  is  to  bring  this  measure  to  1. 
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4  Handling  Perspective  Distortions 


The  linear  combination  scheme  presented  above  accurately  handles  changes  in  viewpoint  assum¬ 
ing  the  images  are  obtained  under  weak  perspective  projection.  Error  analysis  and  experimental 
results  demonstrate  that  in  many  practical  cases  this  assumption  is  valid.  In  cases  where  per¬ 
spective  distortions  are  too  large  to  be  handled  by  a  weak  perspective  approximation,  matching 
between  the  model  and  the  image  can  be  facilitated  in  two  ways.  One  possibility  is  to  avoid 
cases  of  large  perspective  distortion  by  augmenting  the  library  of  stored  models  with  additional 
models.  In  a  relatively  dense  library  there  usually  exists  a  model  that  is  related  to  the  image 
by  a  sufficiently  small  transformation  avoiding  such  distortions.  The  second  alternative  is  to 
improve  the  match  between  the  model  and  the  image  using  an  iterative  process.  In  this  section 
we  consider  the  second  option. 

The  suggested  iterative  process  is  based  on  a  Taylor  expansion  of  the  perspective  coordi¬ 
nates.  As  described  below,  this  expansion  results  in  a  polynomial  consisting  of  terms  each 
of  which  can  be  appro.ximated  by  linear  combinations  of  views.  The  first  term  of  this  series 
represents  the  orthographic  approximation.  The  process  resembles  a  method  of  matching  3D 
points  with  2D  points  described  recently  by  DeMenthon  and  Davis  [3].  In  this  case,  however, 
the  method  is  applied  to  2D  models  rather  than  3D  ones.  In  our  application  the  3D  coordinates 
of  the  model  points  are  not  provided:  instead  they  are  approximated  from  the  model  views. 

.An  image  point  (x,y)  =  [f  X ! Z,  fV jZ)  is  the  projection  of  some  object  point.  (X.)'.Z)  in 
the  image,  where  /  denotes  the  focal  length.  Consider  the  following  Taylor  expansion  of  1/Z 
around  some  depth  value  Zq- 


7  = 


k=0 


(-1)*^  (Z-Zo)^- 


1 


\  - 1)!  V  Zo  / 


The  Taylor  series  describing  the  position  of  a  point  x  is  therefore  given  by 


X  - 


11  =  11 

Z  Zo 


Zo  J 


(20) 


(211 


Notice  that  the  zero  term  contains  the  orthographic  approximation  for  x.  Denote  by  A**'*  the 
Arth  term  of  the  series: 

/Aj-1)^  fZ-Zol 

Zq  {k 

A  recursive  definition  of  the  above  series  is  given  below. 


\(*.|  _  y  • »  ( - 1 ) '  /  Z  -  Zq  \ 

"  k-  1)!  V  J 


(22) 
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Initialization: 


_(0)  _  a(0)  _ 

Zo 


Iterative  step: 


(k-l)Zo 


where  represents  the  A:th  order  approximation  for  i,  and  represents  the  highest  order 
term  in  z***. 


According  to  the  orthographic  approximation  both  X  and  Z  can  be  expressed  as  linear  com¬ 
binations  of  the  model  views  (Eq.  (4)).  We  therefore  apply  the  above  procedure,  approximating 
A’  and  Z  at  every  step  using  the  linear  combination  that  best  aligns  the  model  points  with  the 
image  points.  The  general  idea  is  therefore  the  following.  First,  we  estimate  and  A^°*  by 
solving  the  orthographic  case.  Then  at  each  step  of  the  iteration  we  improve  the  estimate  by 
seeking  the  linear  combination  that  best  estimates  the  factor 


Z-Zo 

{k-l)Zo^  A<'^-» 


(23) 


Denote  by  x  €  72"  the  vector  of  image  point  coordinates,  and  denote  by 


=  [xi,yi,X2. 1]  (24) 

an  n  X  4  matri.x  containing  the  position  of  the  points  in  the  two  model  images.  Denote  by 
p+  _  f^pTp^-ipT  flip  pseudo-inverse  of  P  (we  assume  P  is  overdetermined).  Also  denote 
by  a***  the  coefficients  computed  for  the  kl\\  step.  represents  the  linear  combination 

computed  at  that  step  to  appro.ximate  the  A'  or  the  Z  values.  Since  at  every  step  Zq,  /,  and 
k  are  constant  they  can  be  merged  into  the  linear  combination.  Denote  by  x(^)  and  the 
vectors  of  computed  values  of  x  and  A  at  the  A'th  step.  An  iterative  procedure  to  align  a  model 
to  the  image  is  described  below. 


Initialization: 

Solve  the  orthographic  appro.ximation.  namely 

a<®'  =  P+x 

x(0)  ^  ^(0)  =  p^m 

Iterative  step: 

q'‘‘)  =  (x-x'^-'))-A<'--') 
a**’  = 

A<*»  =  (Pa<*>)e 

x(M  =  + 


II 


where  the  vector  operations  0  and  are  defined  as 


u  0  V 
u  ^  V 


5  Projection  Model  -  Error  Analysis 

In  this  section  we  estimate  the  error  obtained  by  using  the  linear  combination  method.  The 
method  assumes  a  weak  perspective  projection  model.  We  compare  this  assumption  with  the 
more  accurate  perspective  projection  model. 

A  point  [X,Y,Z)\s  projected  under  the  perspective  model  to(i,y)  =  X  jZ,  fY  IZ)\x\  the 
image,  where  /  denotes  the  focal  length.  Under  our  weak  perspective  model  the  same  point 
is  appro.ximated  by  (i,y)  =  (sA'.sT)  where  5  is  a  scaling  factor.  The  best  estimate  for  s,  the 
scaling  factor,  is  given  by  s  —  //Zq.  where  Zq  is  the  average  depth  of  the  observed  environment. 
Denote  the  error  by 

E  =  \x-x\  (2.5) 

The  error  is  e.\pressed  by 

E  =  \fX(Y^-^)\  (26) 

Changing  to  image  coordinates 

E=xZ{^-~)  (27) 

or 

£  =  ixi  -  1  (28) 

The  error  is  small  when  the  measured  feature  is  close  the  optical  axis,  or  when  our  estimate 
for  the  depth.  Zo.  is  close  to  the  real  depth.  Z.  This  supports  the  basic  intuition  that  for 
images  with  low  depth  variance  and  for  fixated  regions  (regions  near  the  center  of  the  image), 
the  obtained  perspective  distortions  are  relatively  small,  and  the  system  can  therefore  identify 
the  scene  with  high  accuracy.  Figures  1  and  2  show  the  depth  ratio  Z/Zo  as  a  function  of  x  for 
(  =  10  and  20  pixels,  and  Table  1  shows  a  number  of  examples  for  this  function.  The  allowed 
depth  variance.  Z/Zq.  is  computed  as  a  function  of  x  and  the  tolerated  error,  c.  For  example, 
a  10  pixel  error  tolerated  in  a  field  of  view  of  up  to  ±50  pixels  is  equivalent  to  allowing  depth 
variations  of  20%.  From  this  discussion  it  is  apparent  that  when  a  model  is  aligned  to  the  image 
the  results  of  this  alignment  should  be  judged  differently  at  different  points  of  the  image.  The 
farther  away  a  point  is  from  the  center  the  more  discrepancy  should  be  tolerated  between  the 
prediction  and  the  actual  image.  five  pixel  error  at  position  i  =  50  is  equivalent  to  a  10  pixel 
error  at  position  x  =  100. 

So  far  we  have  considered  the  discrepancies  between  the  weak  perspective  and  the  perspec¬ 
tive  projections  of  points.  The  accuracy  of  the  LC  scheme  depends  on  the  validity  of  the  weak 
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Figure  1;  ^  as  a  function  of  i  for  <  =  10  pixels. 


Figure  '2:  as  a  function  of  a-  for  (  =  20  pixels. 
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X  \  ( 

5 

10 

15 

20 

25 

1.2 

1.4 

1.6 

1.8 

50 

1.1 

1.2 

1.3 

1.4 

75 

1.07 

1.13 

1.2 

1.27 

100 

1.05 

1.1 

1.15 

1.2 

Table  1:  Allowed  depth  ratios.  as  a  function  of  x  (half  the  width  of  the  field  considered) 
and  the  error  allowed  (t.  in  pixels). 


perspective  projection  both  in  the  model  views  and  for  the  incoming  image.  In  the  rest  of  this 
section  we  develop  an  error  term  for  the  LC  scheme  assuming  that  both  the  model  views  and 
the  incoming  image  are  obtained  by  perspective  projection. 

The  error  obtained  by  using  the  LC  scheme  is  given  by 

E  =  \x  -  ax  \  -  by  I  -  cx2  -  d|  (29) 


Since  the  scheme  accurately  predicts  the  appearances  of  points  under  weak  perspective  projec¬ 
tion.  it  satisfies 

X  =  aij  -  byi  -  cii  -  d  (30) 

where  accented  letters  represent  orthographic  approximations.  Assume  that  in  the  two  model 
|)ictures  the  depth  ratios  are  roughly  equal: 

^  Zo2 
Z''  "  Z,  '  Z^ 


(This  condition  is  satisfied,  for  example,  when  between  the  two  model  images  the  camera  only 
translates  along  the  image  plane.)  Tsing  the  fact  that 


=  11^  UL^  =  -£o 

Z  Zo  z  ""z 


(32) 


we  obtain 


E 


|j-  -  ax  I  -  byi  -  cij  -  d\ 
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The  error  therefore  depends  on  two  terms.  The  first  gets  smaller  as  the  image  points  get  closer 
to  the  center  of  the  frame  and  as  the  difference  between  the  depth  ratios  of  the  model  and  the 
image  gets  smaller.  The  second  gets  smaller  as  the  translation  component  gets  smaller  and  as 
the  model  gets  close  to  orthographic. 

Following  this  analysis,  weak  perspctive  can  be  used  as  a  projection  model  when  the  depth 
variations  in  the  scene  are  relatively  low  and  when  the  system  concentrates  on  the  center  part 
of  the  image.  We  conclude  that,  by  fixating  on  distinguished  parts  of  the  environment,  the 
linear  combinations  scheme  can  be  used  for  localization  and  positioning. 


6  Imposing  Constraints 

Localization  and  positioning  require  a  large  memory  and  a  great  deal  of  on-line  computation. 
.A  large  number  of  models  must  be  stored  to  enable  the  robot  to  navigate  and  manipulate 
in  relatively  large  and  complicated  environments.  The  computational  cost  of  model-image 
comparison  is  high,  and  if  context  (such  as  path  history)  is  not  available  the  number  of  required 
comparisons  may  gel  very  large,  lo  reduce  this  computational  cost  a  number  of  constraints  may 
be  employed.  These  constraints  lake  advantage  of  the  structure  of  the  robot,  the  properties  of 
indoor  environments,  and  the  natural  properties  of  the  navigation  task.  This  section  examines 
some  of  these  constraints. 

One  thing  a  system  may  attempt  to  do  is  to  build  the  set  of  models  so  as  to  reduce  the 
effect  of  perspective  distortions  in  order  to  avoid  performing  iterative  computations.  Views 
of  the  environment  obtained  when  the  system  looks  relatively  deep  into  the  scene  usually 
satisfy  this  condition.  When  perspective  distortions  are  large  the  system  may  consider  modeling 
subsets  of  views  related  by  a  translation  parallel  lo  the  image  plane  (perpendicular  to  the  line 
of  sight).  In  this  case  tlie  depth  values  of  the  points  are  roughly  equal  across  all  images 
considered,  and  it  can  be  shown  that  novel  views  can  be  expressed  by  linear  combinations  of 
two  model  views  even  in  the  presence  of  large  perspective  distortions.  This  becomes  apparent 
from  the  following  derivation.  Let  ( .V,.  Z, ).  1  <  »  <  u  be  a  point  projected  in  the  image 
to  =  (/-V./Zi./V./Z, ).  and  let  (T|.y,')  be  the  projected  point  after  applying  a  rigid 

transformation.  .Assuming  that  Z'  =  Z,  we  obtain 

Z,i',  =  I’liA, ri2ii -1- ri3Z, -h 

Z,y',  =  ’•22V, rjaZ, -I- ty  (34) 

Dividing  by  Z,  we  obtain 

,  1 
=  'ii^i  +  fiziti  +  r|3 -t- t,-— 
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r2ix,  +  r22y,  +  r23  + 


(35) 


y.  = 


Rewriting  this  in  vector  equation  form  gives 

x'  =  rnx  +  ri2y  +  ri3l  + 

y'  =  rjix  +  r22y  +  1-231  +  tyZ“’  (36) 

where  x.  y,  x'.  and  y'  are  the  vectors  of  x,,  y,,  ij,  and  y,'  values  respectively,  1  is  a  vector 
of  all  Is,  and  z“‘  is  a  vector  of  1/Z,  values.  Consequently,  as  in  the  weak  perspective  case, 
novel  views  obtained  by  a  translation  parallel  to  the  image  plane  can  be  expi'  sed  by  linear 
combinations  of  four  vectors. 

.An  indoor  environment  usually  provides  the  robot  with  a  flat,  horizontal  support.  Conse¬ 
quently.  the  motion  of  the  camera  is  often  constrained  to  rotation  about  the  vertical  ()'}  a.\is 
and  to  translation  in  the  .VZ-plane.  Such  motion  has  only  three  degrees  of  freedom  instead  of 
tlie  six  degrees  of  freedom  in  the  general  case.  Under  this  constraint  fewer  correspondences  are 
required  to  align  the  model  with  the  image.  For  example,  in  Eq.  (4)  (above)  the  coefficients 
fij  =  6]  =  63  =  6^  =  0.  Three  point.s  rather  than  four  are  required  to  determine  the  coefficients 
by  solving  a  linear  system.  Two.  rather  than  three,  are  required  if  the  quadratic  constraints  are 
also  considered.  .Another  advantage  to  considering  only  horizontal  motion  is  the  fact  that  tltis 
motion  constrains  the  possible  epipolar  lines  between  images.  This  fact  can  be  used  to  guide 
the  task  of  correspondence  seeking. 

Objects  in  indoor  environments  sometimes  appear  in  roughly  planar  settings.  In  particular, 
the  relatively  static  objects  tend  to  be  located  along  walls.  Such  objects  include  windows, 
shelves,  pictures,  closets  and  tables.  When  the  assumption  of  orthographic  projection  is  valid 
(for  example,  wheii  the  robot  is  relatively  distant  from  the  wall,  or  when  the  line  of  sight  is 
roughly  perpendicular  to  the  wall)  the  transformation  between  any  two  views  can  be  described 
by  a  ’20  affine  transformation.  The  dimension  of  the  space  of  views  of  the  scene  is  then  reduced 
to  three  (rather  than  four;,  and  Eq.  (4)  becomes 

x'  =  fliXi  +  n2yi  +04! 

y'  =  6iX|  +  fcjyi  +  (3~) 

(03  =  1/3  =  0.)  Only  one  view  is  therefore  sufficient  to  model  the  scene. 

.Most  office-like  indoor  environments  are  composed  of  rooms  connected  by  corridors.  Navi¬ 
gating  in  such  an  environment  involves  maneuvering  through  the  corridors,  entering  and  exiling 
the  rooms.  .Not  all  points  in  .Mich  an  environment  are  equally  important.  Junctions,  places  where 
the  robot  faces  a  number  of  options  for  changing  its  direction,  are  more  important  than  other 
places  for  navigation.  In  an  indoor  environment  these  places  include  the  thresholds  of  rooms 
and  the  beginnings  and  ends  of  corridors.  .A  navigation  system  would  therefore  lend  to  store 
mure  models  for  these  points  than  for  others. 

One  important  property  shared  by  many  junctions  is  that  they  are  confined  to  relatively 
small  area.s.  Consider  for  example  the  threshold  of  a  room.  It  is  a  relatively  narrow  place 
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til  at  separates  the  room  from  the  adjacent  corridor.  When  a  robot  is  about  to  enter  a  room, 
a  common  behavior  includes  stepping  through  the  door,  looking  into  the  room,  and  identifying 
it  before  a  decision  is  made  to  enter  the  room  or  to  avoid  it.  The  set  of  interesting  images  for 
this  task  includes  the  set  of  views  of  the  room  from  its  entrance.  Provided  that  thresholds  are 
narrow  these  views  are  related  to  each  other  almost  exclusively  by  rotation  around  the  vertical 
axis.  Under  perspective  projection,  such  a  rotation  is  relatively  easy  to  recover.  The  position 
of  points  in  novel  views  can  be  recovered  from  one  model  view  only.  This  is  apparent  from 
the  following  derivation.  Consider  a  point  p  =  {X,Y,Z).  Its  position  in  a  model  view  is  given 
by  (i,y)  =  (fX/Z,fY’/Z).  Now.  consider  another  view  obtained  by  a  rotation  R  around  the 
camera.  The  location  of  p  in  the  new  view  is  given  by  (assuming  /  =  1) 

,  /  /,  ,  ,  ,.,ov 

.(J.y)  =  ( - r. - rr- - - rr- - rr- - y)  (38) 

r3i.\  +  r32)  +  r33Z  r3iA  +  r32i  + 


implying  that 


(I  .y  ) 


^  ill  J  •f'  ^I2y  fl3  t‘2iX  +  r22y  +  f23  ^ 

r3iX  +  r32y  +  r3ii  +  r32y  +  r33 


(39) 


Depth  is  therefore  not  a  factor  in  determining  the  relation  between  the  views.  Eq.  (39)  becomes 
even  simpler  if  only  rotations  about  the  V-a.\is  are  considered: 


(x'.y')  =  ( 


jcoso  +  sino  y 

-X  sin  a  +  coso  '  -i  sin  q  +  cosq 


(40) 


where  o  is  the  angle  of  rotation,  in  this  case  c»  can  be  recovered  merely  from  a  single  corre- 
■•.poi  lence. 


7  Experiments 

The  LC  method  was  iiiipleiiiented  and  applied  to  images  taken  in  an  indoor  environment. 
Images  of  two  offices,  .A  and  B.  that  have  similar  structures  were  taken  using  a  Panasonic  camera 
with  a  focal  length  of  700  pixels.  Semi-static  objects,  such  as  heavy  furniture  and  pictures,  were 
ii.>.ed  to  distinguish  between  the  oflices.  Figure  3  shows  two  model  views  of  office  A.  The  views 
were  taken  at  a  distance  of  about  -fni  from  the  wall.  Correspondences  were  picked  manually. 
1  he  results  of  aligning  the  model  views  to  images  of  the  two  offices  are  presented  in  Figure  4. 
riie  left  image  contains  an  overlay  of  a  predicted  image  (the  thick  white  lines),  constructed  by 
linearly  combining  the  two  view.s.  and  an  actual  image  of  office  A.  A  good  match  between  the 
two  was  achieved.  The  right  image  contains  an  overlay  of  a  predicted  image  constructed  from 
a  model  of  office  B  and  an  image  of  office  A.  Because  the  offices  share  a  similar  structure  the 
static  cues  (the  wall  corners)  were  perfectly  aligned.  The  semi-static  cues,  however,  did  not 
match  any  features  in  the  image. 

Figure  b  shows  the  matching  of  the  model  of  office  A  with  an  image  of  the  same  office  ob¬ 
tained  by  a  relatively  large  inotion  forward  (about  2m)  and  to  the  side  (about  1.5m).  .Although 
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Figure  3:  Two  model  views  of  office  A. 


Figure  A:  Matching  a  model  of  office  A  to  an  image  of  office  A  (left),  and  matching  a  model  of 
office  B  to  the  same  image  (right) 


Figure  5:  Matching  a  model  of  office  A  to  an  image  of  the  same  office  obtained  by  a  relatively 
large  motion  forward  and  to  the  right. 
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Figure  6:  Two  model  views  of  a  corridor. 


Figure  7:  .Matclimg  the  corridor  model  with  two  images  of  the  corridor.  The  right  image  was 
obtained  by  a  relatively  large  motion  forward  (about  half  of  the  corridor  length)  and  to  the 
right. 


the  distances  are  relatively  short  most  perspective  distortions  are  negligible,  and  a  good  match 
between  the  model  and  the  image  is  obtained. 

Another  set  of  images  was  taken  in  a  corridor.  Here,  because  of  the  deep  structure  of 
the  corridor,  perspective  distortions  are  noticeable.  Nevertheless,  the  alignment  results  still 
demonstrate  an  accurate  match  in  large  portions  of  the  image.  Figure  6  shows  two  model  views 
of  the  corridor.  Figure  7  (left)  shows  an  overlay  of  a  linear  combination  of  the  model  views 
with  an  image  of  the  corridor.  It  can  be  seen  that  the  parts  that  are  relatively  distant  align 
perfectly.  Figure  7  (right)  shows  the  matching  of  the  corridor  model  with  an  image  obtained  by 
a  relatively  large  motion  (about  half  of  the  corridor  length).  Because  of  perspective  distortions 
the  relatively  near  fealiire>  im  longer  align  (e.g..  the  near  door  edges).  The  relatively  far  edges, 
however,  still  match. 

The  next  experiment  shows  the  application  of  the  iterative  process  presented  in  .Section  4 
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ill  cases  where  large  perspective  distortion  were  noticeable.  Figure  8  shows  two  model  views, 
and  Figure  9  shows  the  results  of  matching  a  linear  combination  of  the  model  views  to  an 
image  of  the  same  office.  In  this  case,  because  the  image  was  taken  from  a  relatively  close 
distance,  perspective  distortions  cannot  be  neglected.  The  effects  of  perspective  distortions  can 
be  noticed  on  the  right  corner  of  the  board,  and  on  the  edges  of  the  hanger  on  the  top  right. 
Perspective  effects  were  reduced  by  using  the  iterative  process.  The  results  of  applying  this 
procedure  after  one  and  three  iterations  are  shown  in  Figure  10. 

The  experimental  Results  demonstrate  that  the  LC  method  achieves  accurate  localization  in 
many  cases,  and  that  when  the  method  fails  because  of  large  perspective  distortions  an  iterative 
computation  can  be  used  to  improve  the  quality  of  the  match. 


8  Conclusions 

A  method  of  localization  and  positioning  in  an  indoor  environment  was  presented.  The  method 
is  based  on  representing  the  scene  as  a  set  of  2D  views  and  predicting  the  appearance  of  novel 
views  by  linear  combinations  of  the  model  views.  The  method  accurately  approximates  the 
appearances  of  scenes  under  weak  perspective  projection.  Analysis  of  this  projection  as  well 
as  experimental  results  demonstrate  that  in  many  cases  this  approximation  is  sufficient  to 
accurately  describe  the  scene.  When  the  weak  perspective  approximation  is  invalid,  either  a 
larger  number  of  models  can  be  acquired  or  an  iterative  solution  can  be  employed  to  account 
for  the  perspective  distortions. 

The  method  presented  in  this  paper  has  several  advantages  over  existing  methods.  It  uses 
relatively  rich  representations:  the  representations  are  2D  rather  than  3D.  and  localization  can 
be  done  from  a  single  2D  view  only.  The  same  basic  method  is  used  in  both  the  localization 
and  positioning  problems,  and  a  simple  algorithm  for  repositioning  is  derived  from  this  method. 
Future  work  includes  handling  the  problem  of  acquisition  and  maintenance  of  models,  develoj)- 
ing  efficient  and  robust  algorithms  for  solving  the  correspondence  problem,  and  building  maps 
using  visual  input. 


Appendix 

In  this  appendix  we  derive  the  explicit  values  of  the  coefficients  of  the  linear  combinations  for  the 
case  of  horizontal  motion.  Consider  a  point  p  =  (x,y.:)  that  is  projected  by  weak  perspective 
to  three  images,  Pj,  Fj.  and  P'.  Pi  is  obtained  from  P\  by  a  rotation  about  the  V-axis  by  an 
angle  q.  translation  and  scale  factor  and  P'  is  obtained  from  P\  a  rotation  about  the 
1  -axis  by  an  angle  0.  translation  and  scale  The  position  of  p  in  the  three  images  is  given 
b> 


(ji.j/i)  =  (j-,y) 
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Figure  9:  Malchiiig  tlie  model  to  an  image  obtained  by  a  relatively  large  motion.  Perspect 
distortions  can  be  seen  in  the  table,  the  board,  and  the  hanger  at  the  upper  right. 
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Figure  10;  The  results  of  applying  the  iterative  process  to  reduce  perspective  distortions  after 
one  (left)  and  three  (right)  iterations. 
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(l'2.y2)  =  (S,nX  C-OS  a  +  SmZ  sin  Q  +  im,Smy) 

(^\y')  =  (Spi  COS  ff -i- SpZ sin  ff  +  ip,  Spy) 

The  point  (x',y')  can  be  expressed  by  a  linear  combination  of  the  first  two  points: 

x'  =  aiXi  +  a2i2  +  0.3 
y’  =  byi 


Rewriting  these  equations  we  get 

SpX  cos  6  +  SpZ  sin  6  +  tp  =  cji  +  a2(Sm2‘cosa  +  Sm^sino  +  tm)  +  03 

Spy  =  by 

Equating  the  values  for  the  coefficients  in  both  sides  of  these  equations  we  obtain 

SpCOsS  =  ai+a2SmCOSQ 
S;,sinfl  =  a25mSina 
tp  =  a2tm  +  fl3 

Sp  =  b 

and  the  coefficients  are  therefore  given  by 

Sp  sin(Q  -  0) 

«i  =  — — : - 

sin  o 
Sp  sind 

«3  =  - 

Sm  sin  Q 

f„5psin^ 

^4  —  ip 

Sm  Sin  Q 

b  =  Sp 
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